Machine learning-based extrachromosomal DNA identification in large-scale cohorts reveals its clinical implications in cancer

The clinical implications of extrachromosomal DNA (ecDNA) in cancer therapy remain largely elusive. Here, we present a comprehensive analysis of ecDNA amplification spectra and their association with clinical and molecular features in multiple cohorts comprising over 13,000 pan-cancer patients. Using our developed computational framework, GCAP, and validating it with multifaceted approaches, we reveal a consistent pan-cancer pattern of mutual exclusivity between ecDNA amplification and microsatellite instability (MSI). In addition, we establish the role of ecDNA amplification as a risk factor and refine genomic subtypes in a cohort from 1015 colorectal cancer patients. Importantly, our investigation incorporates data from four clinical trials focused on anti-PD-1 immunotherapy, demonstrating the pivotal role of ecDNA amplification as a biomarker for guiding checkpoint blockade immunotherapy in gastrointestinal cancer. This finding represents clinical evidence linking ecDNA amplification to the effectiveness of immunotherapeutic interventions. Overall, our study provides a proof-of-concept of identifying ecDNA amplification from cancer whole-exome sequencing (WES) data, highlighting the potential of ecDNA amplification as a valuable biomarker for facilitating personalized cancer treatment.

Supplementary Figure 2 Supplementary Figure 2 | ecDNA cargo gene prediction modeling and performance estimation.a, Cancer type distribution of 386 TCGA cancer patients for modeling.TCGA cancer type abbreviations are explained at https://gdc.cancer.gov/resources-tcga-users/tcgacode-tables/tcga-study-abbreviations.b, Area under precision-recall curve values of exploratory ecDNA cargo gene prediction logistic regression models by different genelevel molecular profiles.c, Receiver operating characteristic curves and corresponding AUC values on training set (90%) and testing set (10%).d, Precision, sensitivity, and specificity trends on training set (90%) and testing set (10%) under different probability threshold for classification.e, Performance comparison between ecDNA cargo gene prediction models generated from different strategies (logistic regression and XGBOOST) and different feature numbers (11,

Introduction
This supplementary method file provides a detailed account of GCAP (Gene-level Circular Amplicon Prediction), an innovative approach designed to detect extrachromosomal DNA (ecDNA) using wholeexome sequencing (WES) data and absolute copy number profiles.Instead of delving into the background, performance, and validation aspects of GCAP, our focus remains on the practical aspects: describing the data collection and preprocessing methods, outlining the modeling process, and presenting the implementation framework achieved through R packages.Finally, we provide a concise guide on effectively employing the developed R packages.

GCAP modeling for ecDNA amplification identification
The modeling procedure is delineated through the following five sequential steps, with an overarching framework overview depicted in Supplementary Figure 16.
Supplementary Figure 16.Extrachromosomal DNA cargo gene prediction modeling and sample classification workflow.

Step 1. Data collection and preparation
Kim et al. 1 processed whole-genome sequencing (WGS) data from 3,212 tumors sourced from PCAWG and TCGA projects using the AmpliconArchitect 2 software.Each tumor possessing amplicons was categorized into four primary classes with a hierarchical order: 'Circular' (ecDNA), 'BFB' (breakage-fusionbridge), 'HR' (heavily-rearranged), and 'Linear' (linear amplification).Here, 'precedence' implies that if a tumor exhibits both 'Circular' and 'Linear' amplicons, it is labeled as a 'Circular' tumor.Tumors lacking any amplicons were classified as 'No-fSCNA'.In our study, to streamline the focus on ecDNA prediction modeling, we categorized 'Circular' as 'ecDNA+' and grouped all others as 'ecDNA-'.Furthermore, for enhanced analysis of the data, we also classified 'Circular' as circular amplification, amalgamated 'BFB', 'HR', and 'Linear' into noncircular amplification, and designated the remaining cases as 'nofocal'.
We gathered the AmpliconArchitect result classes of 3,212 tumors as reported by Kim et al. 1 and subsequently categorized these classes into either 'ecDNA+' or 'ecDNA-'.We excluded tumors without TCGA identifiers (those with sample barcodes not beginning with 'TCGA').We retrieved the TCGA identifiers and employed them to access the GDC data portal 3 (https://gdc.cancer.gov/) to identify matched TCGA patients with tumor-normal paired WES data.Given the following considerations: 1) the prevalence of ecDNA-negative tumors; 2) most of gene regions in ecDNA+ tumors are ecDNA-negative; 3) AmpliconArchitect might categorize certain ecDNA+ structures as intricate and non-cyclic if breakpoints are overlooked 4 ; 4) the occurrence of BFB cycles can lead to the development of ecDNA, complicating the differentiation between these two modes of amplification 5 ; and 5) the gene-level data modeling is resourceintensive, which necessitates stringent control at both gene and sample levels -we retained a total of 326 matched ecDNA+ tumors while also randomly selecting 30 ecDNA-tumors exhibiting 'Linear' amplification (representing non-ecDNA focal amplification) and 30 'No-fSCNA' (representing no focal amplification) ecDNA-tumors (by the R function sample_n in dplyr package with random seed '2021').
We obtained the TCGA raw WES bam files (aligned by TCGA team with hg38 as reference) for the modeling process using the GDC data transfer tool, gdc-client (https://gdc.cancer.gov/access-data/gdcdata-transfer-tool).Additionally, we acquired clinical annotations of TCGA patients from UCSC Xena 6 through the UCSCXenaShiny 7 platform.

Step 2. Generation of predictive features and response variable for modeling
We employed ASCAT v3 8 to preprocess the TCGA raw whole-exome sequencing (WES) data using hg38 as the reference genome.This preprocessing yielded essential data, including allele-specific copy number profiles (total_cn and minor_cn), as well as estimations of tumor purity and ploidy.Detailed information about this process can be found in the 'Allele specific copy number calling and feature extraction' section of the Methods in the formal manuscript.
To facilitate copy number calling execution in HPC platform and ensure analysis reproducibility, we utilized a specific version of ASCAT (https://github.com/ShixiangWang/ascat/tree/v3.0).Subsequently, we calculated various genomic alteration measures, such as pLOH (percentage of the genome with LOH), AScore (aneuploidy score), and cna_burden (copy number alteration burden), using Sigminer AmpliconArchitect, we obtained the human gene annotation data for hg19 from GENCODE (gencode.v38lift37.annotation.gtf.gz).From this source, we extracted pertinent details for all protein-coding genes.Through the overlap analysis of gene and amplicon regions, we generated the response variable for our modeling endeavor, namely, whether an observed gene serves as an ecDNA cargo gene (where a gene region intersects with a 'Circular' amplicon in a sample).Furthermore, we calculated the frequency of the four amplicon classes for each gene, taking into account 1,000 times the count of samples where an amplicon overlaps a gene, divided by 3,212 (the multiplication by 1,000 was employed to establish a suitable value range).
Through the integration of the aforementioned seven features, along with the frequency of four

Step 3. Modeling and hyper-parameter searching
To address imbalanced modeling data and prevent information leakage across the train-test division 11 , we adopted the area under the precision-recall curve (auPRC) as our evaluation metric.Additionally, we implemented a group k-fold strategy, with k being 10 by default.However, we also utilized k values of 3, 5, 10, and 20 for specific purposes, such as comparison.
To elaborate, all cancer patients earmarked for modeling were sorted based on the number of ecDNA cargo genes, and subsequent random sampling was conducted.This resulted in a distribution of 90% of samples for the training fold and 10% for the testing fold.Furthermore, all gene-level observations linked to each cancer patient were appropriately assigned to the respective fold.In other words, observations from the same cancer patient were allocated exclusively to either the training or testing fold, but not both.
In line with these arrangements, we fine-tuned the XGBOOST model using 10-fold cross-validation data, encompassing a search space of 1,000 hyperparameters drawn from a random selection process.
Drawing guidance from the XGBOOST documentation (https://xgboost.readthedocs.io/en/stable/tutorials/param_tuning.html) and based on our preliminary investigations, we conducted hyperparameter tuning through a random search approach with R code provided below.Step

Focal amplification-based gene and sample classification
To distinguish between extrachromosomal DNA amplification and chromosomal DNA amplification, the prediction probabilities of ecDNA cargo gene (obtained from the XGBOOST model prediction) were utilized to classify various focal amplification subtypes according to the following guidelines: As a prerequisite, focal amplification should lead to at least a four-copy increase beyond tumor ploidy, in line with the previously outlined criteria by Kim et al. 1 Subsequently, the gene classification process unfolded as follows: 1.A gene was assigned to the "nofocal" category (indicating no focal amplification detected) if its total copy number was less than tumor ploidy + 4 copies.
2. A gene was categorized as "circular" (indicating extrachromosomal DNA amplification) if the associated probability exceeded 0.5.
3. A gene was designated as "noncircular" (indicating chromosomal DNA amplification) if the probability was less than 0.5.
Based on these gene classifications, the categorization of a tumor was established by identifying the predominant focal amplification type within that tumor.To enhance the precision of classification, as a default criterion, a 'circular' tumor was required to possess at least one predicted ecDNA cargo gene with a probability exceeding 0.6.
The classification prioritized circular amplification, as per the approach described by Kim et al. stemming from breakage-fusion-bridge (BFB) and heavily-rearranged (HR) mechanisms.This was primarily due to the challenge of characterizing and distinguishing them from linear amplifications (Linear) based solely on WES data.

R package GCAP and command-line interfaces
We have encapsulated our constructed XGBOOST models and realized comprehensive end-to-end analysis workflows within an R package named GCAP.GCAP is accessible for academic use free of charge and can be obtained from https://github.com/ShixiangWang/gcap.
To optimize the usability of GCAP as a prototypical bioinformatics pipeline intended for operation within a Linux command line environment, we have designed two command-line interfaces (CLIs) using the R package GetoptLong (https://github.com/jokergoo/GetoptLong).These CLIs, named gcap-bam.R 32 and 56) by different k-fold (k is 3, 5, 10, 20, respectively) cross-validations in 3 repeats.11 features represent a basic feature set, 32 features represent the 11 features plus 19 copy number signatures, and 56 features represent the 32 features plus 24 cancer types the modeling tumor samples belong to.

1
For instance, a tumor showcasing both circular and noncircular amplifications would be categorized as "circular," while a tumor with solely noncircular amplification would be labeled as "noncircular."In cases where no focal amplifications were evident (i.e., all investigated genes were assigned to the "nofocal" category), the tumor classification would be "nofocal."The definitions of 'circular' and 'nofocal' remain consistent with those outlined in Kim et al.'s study.Unlike the approach taken by Kim et al., we did not refine chromosomal amplicons (i.e., 'noncircular')